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Symbols 

A nx n tridiagonal system 

A n x n matrix 

a, b real solutions of equation ( 1 5) 
d right side of the tridiagonal system 

E n x 2(p - 1) matrix 

T 

E transpose of matrix E 

m order of the submatrix (m = nip) 

n order of the matrix 

Ail number of right sides of the system 

p number of processors 

V nx 2{p - 1 ) matrix 

v vectors 

v', w l vectors 

x solution of the tridiagonal system 

x n x 1 vector 

x\ jc* notation introduced in accuracy analysis 

Y n x 2(p — 1) matrix 

Z 2 (p - 1 ) X 2 (p - 1 ) matrix 

a communication latency (start time) 

(3 transmission rate (bandwidth) 

A A A A = VE t 

A. off-diagonal elements of matrix A 

|i diagonal elements of matrix A 

0" 1 inverse of matrix () 

Abbreviations: 

ADI alternating direction implicit 

CFD computational fluid dynamics 

MFLOPS million floating-point operations per second 

MIMD multiple-instruction multiple-data 

PDD parallel diagonal dominant 

PDE partial differential equation 

RCD recursive doubling 

RISC reduced instruction set computer 

SIMD single-instruction multiple-data 

SOR successive over relaxation 

SPP simple parallel prefix 

OER odd-even reduction 

STT symmetric Toeplitz tridiagonal 
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Abstract 


The parallel diagonal dominant (FDD) algorithm is an efficient tridiagonal 
solver. This paper presents for study a variation of the PDD algorithm , the reduced 
FDD algorithm. The new algorithm maintains the minimum communication provided 
by the FDD algorithm , but has a reduced operation count. The PDD algorithm also 
has a smaller operation count than the conventional sequential algorithm for many 
applications. Accuracy analysis is provided for the reduced FDD algorithm for sym- 
metric Toeplitz tridiagonal ( STT ) systems. Implementation results on Langley's Intel 
P aragon and IBM SF2 show that both the PDD and reduced PDD algorithms are effi- 
cient and scalable. 


1.0. Introduction 

Distributed-memory parallel computers dominate 
today's parallel computing arena. These machines, such 
as the Kendall Square KSR-1, Intel Paragon, TMC 
CM-5, and the recently announced IBM SP2 and Cray 
T3D concurrent systems, successfully deliver high- 
performance computing power for solving certain of the 
so-called “grand-challenge” problems (ref. 1). Despite 
initial success, parallel machines have not been widely 
accepted in the production engineering environment. On 
a parallel computing system, a task has to be partitioned 
and distributed appropriately among processors to reduce 
communication cost and to achieve load balance. More 
importantly, even with careful partitioning and mapping, 
the performance of an algorithm might still be unsatisfac- 
tory because conventional sequential algorithms may be 
serial in nature and may not be implemented efficiently 
on parallel machines. In many cases, new algorithms 
must be introduced to increase parallelism and to take 
advantage of the computing power of the scalable paral- 
lel hardware. 

Solving tridiagonal systems is a basic computational 
kernel of many computational fluid dynamics (CFD) 
applications. Tridiagonal systems appear in multigrid 
methods, alternating direction implicit (ADI) method, 
wavelet collocation method, and in-line successive over 
relaxation (SOR) preconditioners for conjugate gradient 
methods (ref. 2). In addition to solving partial differential 
equations (PDE), tridiagonal systems also arise in digital 
signal processing, image processing, stationary time 
series analysis, and spline curve fitting (ref. 3). One 
direct motivation for developing an efficient kernel for 
solving tridiagonal systems at the National Aeronautics 
and Space Administration (NASA) is that the implicit 
systems of compact schemes (ref. 4), which are relatively 
new finite-difference schemes widely used in production 
codes at Langley Research Center and Ames Research 
Center, are tridiagonal. 

Intensive research has been carried out on the devel- 
opment of efficient parallel tridiagonal solvers. Many 


algorithms have been proposed (refs. 5, 6, and 7), includ- 
ing the recursive doubling reduction method (RCD) 
developed by Stone (ref. 8) and the cyclic reduction or 
odd-even reduction method (OER) developed by Hock- 
ney (ref. 9). In general, parallel tridiagonal solvers 
require global communications, which makes them inef- 
ficient on distributed-memory architectures. Recently, 
we have taken a new approach: to increase parallel per- 
formance by introducing a bounded numerical error. 
Two new algorithms, namely the parallel diagonal domi- 
nant (PDD) algorithm (ref. 2) and the simple parallel pre- 
fix (SPP) algorithm (ref. 10), have been proposed for 
multiple-instruction multiple-data (MIMD) and single- 
instruction multiple-data (SIMD) machines, respectively. 
These two algorithms take advantage of the fact that trid 
iagonal systems arising in compact schemes are diagonal 
dominant. Backed by rigorous accuracy analyses, the 
algorithms truncate communication and computation 
without degrading the accuracy of the calculations. 

In this paper, a new algorithm, the reduced PDD 
algorithm, is studied based on the same approach: 
increasing parallel performance by introducing a 
bounded numerical error. The reduced PDD algorithm, a 
variation of the PDD algorithm, maintains the minimum 
communication provided by the PDD algorithm, but has 
a reduced operation count. The reduced PPD algorithm 
also has a smaller operation count than the conventional 
sequential algorithm for many applications. The empha- 
sis of this study is on implementation issues and perfor- 
mance comparisons of the PDD and reduced PDD 
algorithm. Most of the theoretical results, including the 
introduction of the PDD and reduced PDD algorithm, 
can be found in reference 2. 

This paper is organized as follows. Section 2 pro- 
vides the background of the parallel PDD algorithm. Sec- 
tion 3 introduces the new algorithm, the reduced PDD 
algorithm. Section 4 gives an accuracy analysis for the 
reduced PDD algorithm. Experimental results on the 
Intel Paragon and IBM SP2 multicomputer are presented 
in section 5. Performance comparison of the newly 



proposed algorithm and other existing algorithms, and of 
the two parallel platforms are also discussed in this sec- 
tion. Section 6 provides concluding remarks. 

2.0. Parallel Diagonal Dominant (PDD) 
Algorithm 

A tridiagonal system is a linear system of equations 

Ax = d (1) 

where x = (jcj, x n ) T and d = d n ) T are 

n-dimensional vectors and A is a diagonally dominant 
tridiagonal matrix with order n : 


*0 c 0 

a, b x Cj 


A = 


= [a., b ( , c,] (2) 


3 n-2 b n- 2 c n-2 
a n- 1 K- I 


To solve equation (1) efficiently on parallel comput- 
ers, we partition A into submatrices. We assume that 
n = pm , where p is the number of processors available. 
The matrix A in equation (1) can be written as 

A = A + AA 

where A is a block diagonal matrix with diagonal sub- 
matrices Aj(i = 0 p - 1 ). The submatrices 

A ; (/ = 0, ..., p - 1) are m x m tridiagonal matrices. Let e t 
be a column vector with its ith (0 < i < n - 1 ) element 
being one and all the other entries being zero. We have 


AA = 


a m e ni> C m-\ e m-V a 2m e 2m' C 2m-\ e 2m-V *•’ 
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T 

m - 1 
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T 
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C (p - 1 )m - 1 e (p - 1 )m - 1 


= VE T 


€ (p~\)m-l 
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where both V and E are n x 2 (p - 1) matrices. Thus, we 
have 


A = A + VE t 

Based on the matrix modification formula originally 
defined by Sherman and Morrison (ref. 11) for rank-one 


changes, and assuming that all A/s are invertible, equa- 
tion (1) can be solved by 


* = A-’d = ( A + VE T y x d 

(3) 

x = A- x d-A-'V{I + E T A- x V)- x E T A~ x d 

(4) 

Let 


Ax - d 

(5) 

> 

II 

(6) 

h = E t x 

(7) 

z = i + e t y 

(8) 

-c 

II 

N* 

(9) 

z 

1! 

-5 

(10) 

Equation (4) becomes 


x = x - Ax: 

(11) 


In equations (5) and (6), x and Y are solved by the 
lower/upper (LU) decomposition method. By the struc- 
ture of A and V, this is equivalent to solving 

A ,(* (<) . * ,<0 - H ' (,) ] = a im e 0- c (i + 1 )m - V e m - l] ( 1 2 ) 

/ = 0, ...,p - 1. Here jt^) and are the ith block of x 
and d , respectively, and w^) are possible nonzero 
column vectors of the ith row block of Y. Equation ( 1 2) 
implies that we only need to solve three linear systems of 
order m with the same LU decomposition for each i 
0 = 0, 1). 

Solving equation (9) is the major computation 
involved in the conquer part of our algorithms. Different 
approaches have been proposed for solving equation (9), 
which results in different algorithms for solving tridiago- 
nal systems (ref. 5). The matrix Z in equation (9) has the 
form 


Z = 


1 w<W . 0 

ffl — ! 


4 1} 

i 

0 



0 


0 




1 0 w^- 2) 




V L P - | 2) 0 1 w m-l 2) 




0v^‘> 1 


2 



where for / = 0, 1 are solutions of equa- 

tion (12) and the l’s come from the identity matrix /. 
Here and throughout, the subindex indicates the compo- 
nent of the vector. In practice, especially for a diagonally 
dominant tridiagonal system, the magnitude of the last 
component of v^, j and the first component of 
wV\ may be smaller than machine accuracy when 
p « n . (See section 4 for detailed accuracy analysis.) 
In this case, and v$_ j can be dropped, and Z 
becomes a diagonal block system consisting of 
(p - 1)2 x 2 independent blocks. Thus, equation (9) can 
be solved efficiently on parallel computers, which leads 
to the highly efficient parallel diagonal dominant (PDD) 
algorithm. 


than the number of processors. The PDD algorithm has 
a larger operation count than the Thomas algorithm. 
However, for a sufficiently large number of processors 
and an efficient hardware platform, the computation/ 
communication ratio of the PDD algorithm is high 
enough to render its performance comparable to the best 
case performance of the Thomas algorithm which was 
achieved by solving multiple right-hand sides simulta- 
neously. The reduced PDD algorithm is proposed to fur- 
ther enhance computation because it has the same 
communication cost as the PDD algorithm but has a 
reduced operation count. For some applications, the 
reduced PDD may have a smaller operation count than 
the Thomas algorithm. 


Using p processors, the PDD algorithm consists of 

the following steps: 

Step 1 . Allocate Ay, <fi l \ and elements a im , c ^ + i) m _ i to 
the ilh node, where 0 < i < p - 3 . 

Step 2. Solve equation (12). All computations can be 
executed in parallel on p processors. 

Step 3. Send xfr\ v ^ from the *th node to the (i - l)th 
node, for / = 1, 1. 

Step 4. Solve 




f > 


( \ 

1 wW , 

m - 1 


y u 


xU) t 
m - 1 

o+l) , 

L v o 1 j 


V y 2' +1 ) 


-0+1) 
l *o J 


in parallel on the / th node for 0 <i<p- 2. Then 
send y 2 i from the ith node to the (/ + 1 ) node, for 
i = 0, 2. 

Step 5. Compute equations (10) and (11). We have 


Ax (i) _ [ v (0 w(0] 


>’2/-l 

yn 






In the last step, step 5, of the PDD algorithm, the 
final solution x is computed by combining the intermedi- 
ate results concurrently on each processor: 

xW = x( k ) - _ j vM - 


which requires 4(n - sequential operations and 4m 
parallel operations if p- nfm processors are used. The 
PDD algorithm drops off the first element of w, w 0 and 
the last element of v, v m _ j in solving equation (9). In 
reference 2, we showed that, for symmetric Toeplitz trid- 
iagonal systems (see eq. (14)), we have 


V 


1 

X(a + bZ™^b 2i ) 


sm - 1 m - 1 

£ b 1 ', b 2i /(-b) (~b) m 1 

M = 0 i = 0 


V 

) 


So, when m is large enough (see theorem 1 
for quantitative measurement), we may drop off 
7 + 1, 1, and w h i = 0, 1, - 1, for 

some integer j > 0, while maintaining the required accu- 
racy. If we replace v i by v^, where v i - , for 

i = 0, 1, ...,y - 1, v. = 0, for / ..., m - 1; and 

replace w by w , where w ■ = w i for i=j y . . ., m — 1 , and 
= 0, for i = 0, 1, ...,y - 1; and use v, w in step 5, we 
have 




Step 5': 


For each of these calculations, there are only two neigh- 
boring communications. 

3.0. Reduced PDD Algorithm 

The PDD algorithm has efficient communications. It 
achieves good load balance and is a good choice for solv- 
ing a large single system. However, for systems with 
multiple right sides, the PDD algorithm is competitive 
only with the conventional sequential algorithm, the 
Thomas algorithm (ref. 12). It is not necessarily superior 
to the Thomas algorithm for compact schemes and other 
applications when the order of the matrix is much larger 


AjcM = [ v, vv] 




V 


yik 

xW - £(*) - AxW 


(13) 


It only requires 4^ parallel operations. Replacing step 5 

of the PDD algorithm by step 5', we get the reduced PDD 
algorithm. The key question for the reduced PDD algo- 
rithm is how to find the smallest integer j > 0 that main- 
tains the required accuracy. 
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4.0. Accuracy Analysis 

The PDD algorithm reduces the communication from global to local. In addition to the reduced communication, the 
reduced PDD algorithm further reduces the computation. The PDD and reduced PDD algorithms are efficient because 
they have truncated communication and computation. However, this dropping may lead to an inaccurate solution. Thus, 
an accuracy study is essential in applying the PDD algorithm. Some preliminary study of the accuracy of the PDD algo- 
rithm has been done (refs. 5 and 13); however, the study is for general cases and only provides sufficient conditions to 
guarantee a given accuracy. Unfortunately, the conditions given in references 5 and 13 are difficult to verify, and the 
accuracy bound given is quite loose. A practical, tight error bound is given in reference 2 for a class of tridiagonal sys- 
tems, symmetric Toeplitz tridiagonal (STT) systems. A matrix is Toeplitz if its entries along each diagonal are the same. 
As a special class of tridiagonal systems, STT systems arise in many applications. For instance, the discretization matri- 
ces of the compact scheme (ref. 4) are STT systems. In this section, we extend the recent accuracy analysis on STT sys- 
tems to the reduced PDD algorithm. 

The accuracy analysis of the reduced PDD algorithm is three-fold: first we study the decay rate of the decaying 
elements v^)_ Wq\o < i < p - 1 ); second, we study the influence of dropping v$_ j, w^(0 </</?- 1 ) on the final 
solution, which is die accuracy analysis of the PDD algorithm; and third, the truncation of computation is studied, based 
on the accuracy analysis of the PDD algorithm. The accuracy analysis of the PDD algorithm gives the error bound of the 
reduced PDD algorithm. The error bound of the reduced PDD algorithm is a recent result. See the following analysis. 

A symmetric Toeplitz tridiagonal matrix has the form 


p A 
A, p A 


= [A, p, A] = A[ 1, c, 1 ] 


. A 
A p 


Let a and b be the real solutions of 


(14) 


b + a = c, b • a = 1 (15) 

where c is the diagonal element of matrix [1, c, 1] given by equation ( 14 ). Because we assume \c\ > 2 , we can further 
assume that \b\ < 1 and \a\ > 1 . For decay rate we have the result (ref. 2), 

which leads to theorem 1 . 

Theorem 1: For any diagonal-dominant, symmetric Toeplitz tridiagonal matrix , A - [A, p, A] if b m ~ l /a is less than 
machine accuracy, where a and b are the solutions of equation ( 15), the PDD algorithm approximates the true solution 
to within machine accuracy. 

Theorem 1 states that, if v m _ j, w 0 are less than machine accuracy, the PDD algorithm gives a satisfactory solution. 
In most scientific applications, the accuracy requirement is much weaker than machine accuracy. We need to study how 
the decay rate of v m _ j, w 0 influences the accuracy of the final solution. Let x be the exact solution of equation ( 1 ) and 
let ** be the corresponding final solution of the PDD algorithm. We have the error bound of the PDD algorithm in f 
norm (ref. 2): 


I* -**11 < M 

*11 “( 1 -WxM-i) 


Let v, w be the vectors defined in equation (13), V be the corresponding matrix in equation (6), consisting of all the 
2 {p - 1) vectors, and let x' be the solution of the reduced PDD algorithm. Then 


x = A- x d-A- l V(I + E T A- x V)E T A- x d 
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similar to the accuracy analysis of the PDD algorithm (ref. 2), we let y = (I + E T A~ l V)E T A~ x d. By equation (4) and 
equation (55) in reference 2, 


*'-** = (A~ l V -A~ x V)y = {A~ X V -A~ l V)E Tx 
Therefore, for a given integer j > 0, 


m - 1 


• \\E t \\ = II A-'V-A-'' 


V|| = 


1 

y 

(-*)'( 1 _ * 2 (m - 0) 

X(a + b'Lf~ x b li ) 

La 

i ~ J 

1 -* 2 


Since 


m- 1 


* = J 


b‘( 1 _fo2(m-/)) 


1 - b 2 


|l-* 2 l 


(m- 1 m 1 

£ 1*1' + X 

^ i = j i = j 


l*l'(i-l*i w ) . ,,, m i-i*i m - 7 +1 
1 - 1*1 11 i - 1*1 


i-fo 2 l 

i*U(i -i*r)+i*i m (i -i*i m_2+1 ) 


-* 2 Io-i*d 


\\X - X 31 


1 

I 1 -b 2 

t i*u( i - 1 *n + i*i*(i i b\j+ 1 b\ m 

Xa 

1 _fc2(m + i) 

|(i -* 2 )|(i -|*|) U(|o|-i)| 


By inequality (16), inequality (18) gives the error bound of the reduced PDD algorithm. 


\\x - X II . JC-JC* JC* - X 


1*1' 
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\ 

X 

|X|- 

b( 1 - b 2m ) 

(|fl|-l) 


V 

1 _£>2(m+ 1) 

) 


\b\j + \b\ m 

IMI«I - i)l 


For a given error tolerance e > 0, the right side of inequality (18) 

l*l w 



f 



X 

IM- 

*( i -b 2m ) 

(M - 1) 


l 

1 _ fr2(m + 1) 

/ 


j *u+j*r. 
|A.(W-DI 


if and only if 


log 


IM(M-l) 


1*1' 


W(W-i)| 


l + 1/ 

\x\- 

*( i -* 2m ) 

1 

1 


i _ b 2( - m + •) 

JjJJ 


J> 


log 1*1 


When 


< 1 


|£z£l < l*r (, , 1 1 , l*M 

lull "W(|a|-I)l UI-|*|J W(W-1) 


(17) 

(18) 


(19) 
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and we get a simpler inequality for the minimal number j 


loglXI(M-l) 


j > 


e -W”W<W-l)(i + i5^I 


log |i>! 


(20) 


When \b m \ is less than machine accuracy, inequality (19) becomes the same as inequality (20), and we have an even 
simpler formula: 


log|Xl(M - 1 )e 
log|fc| 


( 21 ) 


Inequality equation (21) gives a lower bound of the number of variables that need to be modified in equation (13) for a 
given error tolerance e > 0. Usually,; is quite small. For instance, when error tolerance £ equals 10 -4 ,; equals either 10 

or 7 when X, the magnitude of the off-diagonal elements, equals ^ or ^ , respectively, the diagonal elements being equal 
to 1 . The integer j reduces to 4 for 0 < X < - . 


5.0. Experimental Results 

The PDD and the reduced PDD algorithms were 
implemented on the 48-node IBM SP2 and 72-node Intel 
Paragon available at Langley Research Center. Both the 
SP2 and Paragon machines are distributed-memory 
parallel computers that adopt message-passing communi- 
cation paradigms and support virtual memory. Each pro- 
cessor (node) of the SP2 is either functionally equivalent 
to a reduced instruction set computer (RISC) System/ 
6000 desktop system (thin node) or a RISC System/6000 
deskside system (wide node). The Paragon XP/S super- 
computer uses the i860 XP microprocessor that includes 
a RISC integer core processing unit and three separate 
on-chip caches for page translation, data, and instruc- 
tions. The Langley SP2 has 48 wide nodes with 
128 Mbytes local memory and peak performance of 266 
million floating-point operations per second (MFLOPS) 
each. In contrast, the Langley Paragon has 72 nodes with 
32 Mbytes of local memory and peak performance of 


75 MFLOPS each. The heart of all distributed-memory 
parallel computers is the interconnection network that 
links the processors together. The SP2 high-performance 
switch is a multistage packet-switched Omega network 
that provides a minimum of four paths between any pair 
of nodes in the system. The Intel Paragon processors are 
connected in a two-dimensional (2-D) rectangular mesh 
topology. The diameter of the 2-D mesh topology 
increases with the number of processors. Communi- 
cation delay on a message-passing distributed-memory 
machine usually can be modeled by using two parame- 
ters, the latency (start time) a and transmission rate (in 
terms of transmission time per byte) (3. For the SP2, the 
latency is 30 psec and transmission rate is 2 psec. For 
Paragon, the latency is 46 psec and transmission rate is 
6 psec. 

Table 1 gives the computation and communication 
count of the PDD algorithm. The best conventional 
sequential algorithm for the LU decomposition method 


Table 1 . Computation and Communication Counts of PDD Algorithm 


System 

Matrix 

Best 

sequential 

PDD 

Computation 

Communication 

Single 

Nonperiodic 

r- 

1 

c 

GO 

17- -4 
P 

2a+ 12P 

Periodic 

14n - 16 

17- - 4 

P 

2a + 12(3 

Multiple 
right sides 

Nonperiodic 

(5/i - 3) • nl 

( 9 ;i 

(2a + 8«l • P) 

Periodic 

(7/i- 1) •n\ 


(2a -h 8/i 1 • (!) 
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for tridiagonal systems is the Thomas algorithm (ref. 14). 
For most distributed-memory computers, the time to 
communicate with nearest neighbors varies linearly with 
problem size. Let 5 be the number of bytes to be trans- 
ferred. Then the transfer time to communicate with a 
neighbor can be expressed as a 4* 5(3. Assuming 4 bytes 
are used for each real number, steps 3 and 4 of the PDD 
and reduced PDD algorithm take a + 8|3 and a + 4(3 
time, respectively, on any architecture that supports sin- 
gle array topology. Tridiagonal systems arising in both 
ADI and compact scheme methods, which are two 
widely used methods in CFD applications, are multiple 
right-side systems. They are usually “kernels” in much 
larger codes. The computation and communication 
counts for solving multiple right-side systems are listed 
in table 1, in which the factorization of matrix A and 
computation of Y are not considered (see eqs. (5) and (6) 
in Section 2). Parameter n\ is the number of right-hand 
sides. Note that, for multiple right-side systems, the com- 
munication cost increases with the number of right-hand 
sides. If the boundary conditions are periodic, the tridiag- 
onal systems arising in CFD applications are periodic 
tridiagonal systems. As shown in reference 2, the PDD 
algorithm, and consequently the reduced PDD algorithm, 
can be extended to solve periodic tridiagonal systems as 
well. Table 1 also lists computing and communication 
counts for solving periodic systems. 

Table 2 gives the computation and communication 
counts of the reduced PDD algorithm. As for the PDD 
algorithm, it has the same parallel computation and com- 
munication counts for both periodic and nonperiodic sys- 
tems. The computational saving of the reduced PDD 
algorithm is not only in step 5, the final modification 
step, but also in other steps. Because we only need j ele- 
ments of vectors v and w for the final modification in the 
reduced PDD algorithm (eq. (13) in section 3), we only 
need to compute j elements for each column of V in solv- 
ing equation (6). The integer j is given by equations (19), 


Table 2. Computation and Communication Counts of Reduced 
PDD Algorithm 


System 

Reduced PDD 

Computation 

Communication 

Single 

11- + 6y-4 
P 

2a+ 12P 

Multiple 
right sides 

(5^ + 4/+ 1 J»«l 

(2a + 8/1 1 »P) 


(20), or (21), depending on the particular circumstance. 
Notice that, when j < n! 2, the reduced PDD algorithm 
has a smaller operation count than that of the Thomas 
algorithm for periodic systems with multiple right-hand 
sides. 

While the accuracy analyses given in this study are 
for Toeplitz tridiagonal systems, the PDD algorithm and 
the reduced PDD algorithm can be applied for solving 
general tridiagonal systems. The computation counts 
given in tables 1 and 2 are for general tridiagonal sys- 
tems. For symmetric Toeplitz tridiagonal systems, a fast 
method proposed by Malcolm and Palmer (ref. 15) has a 
smaller computation count than the Thomas algorithm 
for systems with single right-hand sides. It requires only 
5n -h 2A: — 3 counts for arithmetic, where k is a decay 
parameter, depending on the diagonal dominancy of the 
system. Formulas are available to compute the upper and 
lower bounds of parameter k (ref. 15). The computation 
savings of Malcolm and Palmer’s method are in the LU 
decomposition. For systems with multiple right-hand 
sides, in which the factorization cost is not considered, 
the Malcolm and Palmer’s method and the Thomas 
method have the same computation count. Table 3 gives 
the computation and communication counts of the PDD 
and reduced PDD algorithms based on Malcolm and 


Table 3. Computation and Communication Counts for Symmetric Toeplitz Systems 


Algorithm 

Matrix 

Best 

sequential 

Parallel Algorithm 

Computation 


PDD 

Algorithm 

Nonperiodic 

5n + 2k - 3 

14— + 2k 

P 

2a + 12(3 

Periodic 

tin + 2*- 12 

14- +2* 

P 

2a + 12P 

Reduced 

PDD 

Algorithm 

Nonperiodic 

5/i + 2k - 3 

%- + 2k + 6j 
P 

2a + 8P 

Periodic 

\ln + 2k- 12 

S- + 2k + 6j 
P 

2a + 8P 
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Palmer’s algorithm. The computation counts of the two 
algorithms are reduced with the fast method used in solv- 
ing the subsystems. Table 3 shows the computation and 
communication counts for solving systems with a single 
right-hand side. For systems with multiple right-hand 
sides, the computation counts remain the same as in 
tables 1 and 2 for both the PDD and the reduced PDD 
algorithms, respectively. 


As an illustration of the algorithm and theoretical 
results given in previous sections, a sample matrix is 
tested here. This sample matrix is a periodic, symmetric, 
Toeplitz system 



1 

3 


A = 



1 

3 



( 22 ) 


which arises in the compact scheme. We have 


A = 



| • [1* 3, 1] 


1 

3 


([b, 1,0] x[0, a, 0] x [0, \,b]-AB) 


where A B is an n x n zero matrix, except that the first ele- 
ment on the first row is b, and 


X 




3- 75 
2 


(23) 


The reduced PDD algorithm was first implemented 
on a Sun workstation with double precision to solve the 
tridiagonal system Ax - d for accuracy checking. The 
right-side vector d was generated randomly. Figure 1 
depicts the accuracy comparison of the reduced PDD 
algorithm. The measured and predicted data have been 
converted to a common logarithm scale to make the 
difference visible. The x-coordinate is the order of 
matrix Aj, and the y-coordinate is the relative error in the 
1-norm. From figure 1, we can see that the accuracy anal- 
ysis provides a very good error bound. 

Speedup, one of the most frequently used perfor- 
mance metrics in parallel processing, is defined as 
sequential execution time over parallel execution time. 



Figure 1. Measured and predicted accuracy of reduced PDD 
algorithm. 

Parallel algorithms often exploit parallelism by sacrific- 
ing mathematical efficiency. To measure the true parallel 
processing gain, the sequential execution time should be 
based on a commonly used sequential algorithm. To dis- 
tinguish it from other interpretations of speedup, the 
speedup measured versus a commonly used sequential 
algorithm has been called absolute speedup (ref. 7). 
Another widely used interpretation is the relative 
speedup (ref. 7), which uses the uniprocessor execution 
time of the parallel algorithm as the sequential time. Rel- 
ative speedup measures the performance variation of an 
algorithm in terms of the number of processors and is 
commonly used in scalability studies. Both Amdahl’s 
law (ref. 16) and Gustafson’s scaled speedup (ref. 17) are 
based on relative speedup. In this study, we first use rela- 
tive speedup to study the scalability of the PDD and 
reduced PDD algorithms; then, we use the absolute 
speedup to compare these two algorithms with the con- 
ventionally used sequential algorithm. 

Because execution time varies with communication/ 
computation ratio on a parallel machine, the problem size 
is an important factor in performance evaluation, espe- 
cially for machines supporting virtual memory. Virtual 
address space separates the user logical memory from 
physical memory. This separation allows an extremely 
large virtual memory to be provided (with a much slower 
memory access time) on a sequential machine when only 
a small physical memory is available. If the problem size 
is larger than physical memory, data must be swapped in 
from and out to secondary memory, which may lead 
to inefficient sequential processing and unreasonably 
high speedup. If the problem size is too small, on the 
other hand, when the number of processors increases, 
the workload on each processor will drop quickly, 
which may lead to an extremely high communication/ 
computation ratio and unacceptably low performance. As 
studied in reference 18, the correct choice of initial prob- 
lem size is the problem size that reaches the asymptotic 
speed, the sustained uniprocessor speed corresponding to 
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the main memory access (ref. 1 8). The nodes of SP2 and 
Paragon have different processing powers and local 
memory sizes. For a fixed 1024 right sides, following the 
asymptotic speed concept, the order of matrix for SP2 
was found to be 6400, and the order of matrix for 
Paragon was found to be 1 600. Figures 2 and 3 show the 
measured speedup of the PDD algorithm when the large 
problem size n = 6400 is solved on Paragon and the 
small problem size n=1600 is solved on SP2. For 
comparison, ideal speedup, where speedup equals p 
when p processors are available, is also plotted with the 



Figure 2. Superlinear speedup with large problem size on Intel 
Paragon (1024 system of order 6400). 



Figure 3. Inefficient performance with small problem size on SP2 
(1024 system of order 1600). 


measured speedups. As indicated previously, the large 
problem size leads to an unreasonable superlinear 
speedup on Paragon, and the small problem size leads to 
a disappointingly low performance on SP2. 

From the problem size point of view, speedup can be 
divided into the fixed-size speedup and the scaled 
speedup. Fixed-size speedup fixes the problem size. 
Scaled speedup scales the problem size with the number 
of processors. Fixed-size speedup emphasizes how much 
execution time can be reduced for a given application 
with parallel processing. Amdahl’s law (ref. 16) is based 
on the fixed-size speedup. The scaled speedup concen- 
trates on exploring the computational power of parallel 
computers for solving otherwise intractable large prob- 
lems. Depending on the scaling restrictions of the prob- 
lem size, the scaled speedup can be classified as both the 
fixed-time speedup (ref. 17) and the memory-bounded 
speedup (ref. 19). As the number of processors increases, 
memory -bounded speedup scales problem size to utilize 
the associated memory increase. In general, operation 
count increases much faster than memory requirement. 
Therefore, the workload on each processor will not 
decrease with the increase in number of processors in 
memory-bounded speedup. Thus, scaled speedup is more 
likely to get a higher speedup than that of fixed- size 
speedup. 

Figures 4 and 5 depict the speedup of the fixed-size 
and memory-bounded speedup of the PDD and the 
reduced PDD algorithm, respectively, on the Intel Para- 
gon. From figures 4 and 5 we can see that the PDD and 
the reduced PDD algorithm have the same speedup pat- 
tern. This similarity is very reasonable because these two 
algorithms share the same computation and communica- 
tion pattern. It has been proven that the PDD algorithm, 
and therefore the reduced PDD algorithm, are perfectly 
scalable, in terms of isospeed scalability (ref. 20), on any 



Figure 4. Measured speedup of PDD algorithm on Intel Paragon 
(1024 system of order 1600). 
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Figure 5. Measured speedup of reduced PDD algorithm on Intel 
Paragon (1024 system of order 1600). 

architecture that supports the ring communication net- 
work. However, ring communication cannot be embed- 
ded in 2-D mesh topologies perfectly unless a wrap- 
around is supported. Thus, the communication cost of the 
algorithms increases slightly with the increase in the 
number of processors. The fact that the memory-bounded 
speedups on the Paragon are slightly below the ideal 
speedup is very reasonable. The influence of the commu- 
nication cost has been reflected in the measured speedup. 

Figure 6 demonstrates the speedups of the PDD 
algorithm on the SP2 machine. Because the one-to-one 
communication of the SP2 multistage Omega network 
does not increase with the number of processors, the 
PDD algorithm reaches the ideal memory-bounded 
speedup. In accordance with the isospeed metric 
(ref. 20), the PDD algorithm is perfectly scalable in the 
multistage SP2 machine. 

Although the PDD and reduced PDD have similar 
relative speedup patterns, the execution times of the two 
algorithms are very different. The reduced PDD algo- 
rithm has a smaller execution time than that of the PDD 
algorithm. For periodic systems the reduced PDD algo- 
rithm has an even smaller execution time than the con- 
ventional sequential algorithm. The timing of the 
Thomas algorithm, the PDD algorithm, and the reduced 
PDD algorithm on a single node of the SP2 and Paragon 
machine are listed in table 4. The problem size for all 
algorithms on SP2 is n = 6400 and n\ = 1024 and on 

Table 4. Sequential Timing (in seconds) on Paragon and SP2 
Machines 





PDD 

algorithm 

Reduced 

PDD 

algorithm 

Paragon 

1600 

0.8265 

0.9026 

0.6432 

SP2 


0.7387 

0.856 

0.5545 



Figure 6. Measured speedup of PDD algorithm on a SP-2 (1024 
system of order 6400). 



Figure 7. Speedup of PDD algorithm over Thomas algorithm 
(1024 systems of order 1600). 



Figure 8. Speedup of reduced PDD algorithm over Thomas algo- 
rithm (1024 systems of order 1600). 

Paragon is n = 1600 and n 1 = 1024. The measured results 
confirm the analytical results given in tables 1 and 2. 

Figures 7 and 8 show the speedup of the PDD and 
reduced PDD algorithms over the conventional sequen- 
tial algorithm, the Thomas algorithm, respectively. The 
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PDD algorithm increases computation count for high 
parallelism. The reduced PDD reduces computation 
count by taking advantage of diagonal dominance. Com- 
pared to the Thomas algorithm, while the absolute 
speedup of the PDD algorithm is worse than its relative 
speedup, the reduced PDD algorithm has a better abso- 
lute speedup than its relative speedup. The reduced 
PDD algorithm achieves a superlinear speedup over the 
Thomas algorithm. Experimental results confirm that the 
reduced PDD algorithm maintains the good scalability of 
the PDD algorithm and delivers an efficient performance 
in terms of execution time as well. 

6.0. Concluding Remarks 

Computational fluid dynamics (CFD) constantly 
demands higher computing power than is currently avail- 
able. With current technology, the feasible approach to 
continually increasing computing power seems to be 
through parallel processing: construct a computer that 
consists of hundreds and thousands of processors 
working concurrently. Parallel computers have become 
commercially available; however, unlike their sequential 
counterparts, efficient parallel algorithms require a high 
degree of parallelism and low cost of communication. 

Tridiagonal systems arise in many CFD applications. 
They are usually kernels in much larger codes. Although 
multiple tridiagonal systems are available in the larger 
codes, in many situations it is often more efficient to use 
a parallel tridiagonal solver for these systems than to 
remap data among processors to be able to perform a 
serial solve, especially for distributed memory machines 
where communication cost is high. 

Experimental and theoretical results show that both 
the PDD and reduced PDD algorithms are efficient and 
scalable, even for systems with multiple right sides. For 
periodic systems, as confirmed by our implementation 
results, the reduced PDD algorithm even has a smaller 
sequential execution time than that of the best sequential 
algorithm. The two algorithms are good candidates for 
parallel computers. 

The accuracy analysis and implementation given in 
this study are for Toeplitz systems. The PDD and 
reduced PDD algorithms, however, are applicable for 
general tridiagonal and narrow band linear systems. The 
common merit of these two algorithms is the minimum 
communication required, which makes them even more 
valuable in a distributed computing environment, such 
as the environment of a cluster of a network of 
workstations. 

NASA Langley Research Center 
Hampton, VA 23681-0001 
March 19, 1996 
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